Method and system for modeling a common-language speech recognition, by a computer, under the influence of a plurality of dialects

ABSTRACT

The present invention relates to a method for modeling a common-language speech recognition, by a computer, under the influence of multiple dialects and concerns a technical field of speech recognition by a computer. In this method, a triphone standard common-language model is first generated based on training data of standard common language, and first and second monophone dialectal-accented common-language models are based on development data of dialectal-accented common languages of first kind and second kind, respectively. Then a temporary merged model is obtained in a manner that the first dialectal-accented common-language model is merged into the standard common-language model according to a first confusion matrix obtained by recognizing the development data of first dialectal-accented common language using the standard common-language model. Finally, a recognition model is obtained in a manner that the second dialectal-accented common-language model is merged into the temporary merged model according to a second confusion matrix generated by recognizing the development data of second dialectal-accented common language by the temporary merged model. This method effectively enhances the operating efficiency and admittedly raises the recognition rate for the dialectal-accented common language. The recognition rate for the standard common language is also raised.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a method, a system and a program formodeling a common-language speech recognition, by a computer, under theinfluence of multiple dialects, and also relates to a recording mediumthat stores the program. The present invention particularly relates to afield of speech recognition by a computer.

2. Description of the Related Art

Enhancing robustness has been an important issue and a difficult pointto achieve in the field of speech recognition. A major factor ofdeterioration in robustness of speech recognition lies in a probleminvolving linguistic accents. For example, the Chinese language has manydialects, which leads to a significant problem of accents. The problemgives incentives for ongoing research activities. In the conventionalspeech recognition system, the recognition rate for a standard commonlanguage is high but the recognition rate for an accented commonlanguage influenced by dialects (hereinafter referred to as“dialectal-accented common language” or simply as “dialectal commonlanguage” also) is low. To address this problem, a method such as“adaptation” may be used as a countermeasure in general. However, aprecondition in this case is that a sufficient amount of data for thedialectal-accented common language must be provided. With this method,there are cases where the recognition rate for the standard commonlanguage drops markedly. Since there are many kinds of dialects, thework efficiency is degraded if an acoustic model is trained repeatedlyfor the respective kinds of dialects.

SUMMARY OF THE INVENTION

The present invention has been made in view of the foregoing problems,and one of purposes is to provide a method for modeling a commonlanguage speech recognition, by a computer, under the influence of aplurality of dialects, the method being capable of raising therecognition rate for dialectal-accented common languages with a smallamount of data and guaranteeing to sustain the recognition rate for thestandard common language, and to provide a system therefor.

A method, for modeling a common-language speech recognition by acomputer under the influence of a plurality of dialects, includes thefollowing steps of:

(1) generating a triphone standard common-language model based ontraining data of standard common language, generating a first monophonedialectal-accented common-language model based on development data ofdialectal-accented common language of first kind, and generating asecond monophone dialectal-accented common-language model based ondevelopment data of dialectal-accented common language of second kind;

(2) generating a first confusion matrix by recognizing the developmentdata of the dialectal-accented common language of first kind using thestandard common-language model, and obtaining a temporary merged modelin a manner that the first dialectal-accented common-language model ismerged into the standard common-language model according to the firstconfusion matrix; and

(3) generating a second confusion matrix by recognizing the developmentdata of the dialectal-accented common language of second kind using thetemporary merged model, and obtaining a recognition model in a mannerthat the second dialectal-accented common-language model is merged intothe temporary merged model according to the second confusion matrix.

The merging method as described in the above steps (2) and (3) is suchthat:

a probability density function of the temporary merged model isexpressed by

p′(x|s)=λ₁ p(x|s)+(1−λ₁)p(x|d ₁)p(d ₁ |s)

where x is an observation feature vector of speech to be recognized, sis a hidden Markov state in the standard common-language model, d₁ is ahidden Markov state in the first dialectal-accented common-languagemodel, and λ₁ is a linear interpolating coefficient such that 0<λ₁<1,and

wherein a probability density function of the merged recognition modelis expressed by

${p^{\prime\prime}\left( x \middle| s \right)} = {{\sum\limits_{k = 1}^{K}{w_{k}^{{({sc})}^{\prime}}{N_{k}^{({sc})}( \cdot )}}} + {\sum\limits_{m = 1}^{M}{\sum\limits_{n = 1}^{N}{w_{mn}^{{({{dc}\; 1})}^{\prime}}{N_{mn}^{({{dc}\; 1})}( \cdot )}}}} + {\sum\limits_{p = 1}^{P}{\sum\limits_{q = 1}^{Q}{w_{pq}^{{({{dc}2})}^{\prime}}{N_{pq}^{({{dc}\; 2})}( \cdot )}}}}}$

where w_(k) ^((sc)′) is a mixture weight for the hidden Markov state ofthe standard common-language model, w_(mn) ^((dc1)′) is a mixture weightfor the hidden Markov state of the first dialectal-accentedcommon-language model, w_(pq) ^((dc2)′) is a mixture weight for thehidden Markov state of the second dialectal-accented common-languagemodel, K is the number of Gaussian mixtures for Hidden Markov Modelstate s in the standard common-language model, N_(k) ^((sc))(·) is anelement of Gaussian mixture for Hidden Markov Model state s, M is thenumber of d₁ that is considered as the pronunciation variants occurringbetween the first dialectal-accented common-language model for d₁ andthe standard common-language-model, N is the number of Gaussian mixturesfor Hidden Markov Model state d₁ in the first dialectal-accentedcommon-language model, N_(mn) ^((dc1))(·) is an element of Gaussianmixture for Hidden Markov Model state d₁, P is the number of d₂ that isconsidered as the pronunciation variants occurring between the seconddialectal-accented model for d₂ and the standard common-language model,Q is the number of Gaussian mixtures for Hidden Markov Model state d₂ inthe second dialectal-accented model, N_(pq) ^((dc2))(·) is an element ofGaussian mixture for Hidden Markov Model state d₂.

The method, for modeling a common-language speech recognition by acomputer under the influence of a plurality of dialects, according tothe above embodiment achieves the following advantageous effects.

Each of a plurality of dialectal-accented common models is merged into astandard common-language model using an iterative method, so that theredundant operation of training an acoustic model for each of dialectscan be avoided and therefore the work efficiency can be enhanced. Also,according to this method, the recognition rate for dialectal-accentedcommon languages can be admittedly raised. At the same time, therecognition rate for the standard common language never deteriorates andsometimes increases. Thus, this method resolves a problem, as in otherconventional methods, where the recognition rate for the standard commonlanguage markedly deteriorates while a dialectal-accented commonlanguage is properly treated.

Optional combinations of the aforementioned processes, andimplementations of the invention in the form of apparatuses, systems,recoding media, computer programs and so forth may also be practiced asadditional modes of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments will now be described by way of examples only, withreference to the accompanying drawings which are meant to be exemplary,not limiting, and wherein like elements are numbered alike in severalFigures in which:

FIG. 1 conceptually shows a principle of a modeling method according toan embodiment; and

FIG. 2 is a block diagram showing an example of a modeling system thatrealizes a modeling method as shown in FIG. 1.

DETAILED DESCRIPTION OF THE INVENTION

A description is now given of preferred embodiments of the presentinvention with reference to drawings.

FIG. 1 conceptually shows the principle of a method for modeling aspeech recognition of common language under the influence of an n kindsof dialects (n being an integer greater than or equal to 2) according toan embodiment of the present invention. This modeling method includesthe following three steps of:

(1) generating a triphone standard common-language model based ontraining data of standard common language, and generating first to nthmonophone dialectal-accented common-language models for respectivecorresponding dialectal-accented common languages of first to nth kinds,based on the development data thereof;

(2) generating a first confusion matrix by recognizing the developmentdata of the dialectal-accented common language of first kind using thestandard common-language model, and obtaining a first temporary mergedmodel in a manner that the first dialectal-accented common-languagemodel is merged into the standard common-language model according to thefirst confusion matrix; and

(3) generating an ith confusion matrix by recognizing the developmentdata of dialectal-accented common language of ith kind using an (i−1)thtemporary merged model (i being an integer such that and obtaining afinal recognition model by repeating, from i=2 to i=n, an operation ofmerging the ith dialectal-accented common-language model into the(i−1)th temporary merged model according to the ith confusion matrix.

FIG. 2 is a block diagram showing a system for modeling theaforementioned speech recognition of a common language under theinfluence of a plurality of dialects. A modeling system according to thepresent embodiment comprises a model generation unit 100 and a controlunit 200. Referring to FIG. 2, the model generation unit 100 includestraining database (hereinafter abbreviated as “training DB” also) 10-0,development databases (hereinafter abbreviated as “development DB” also)10-1 to 10-n, model generators 30-0 to 30-n, confusion matrix generators40-1 to 40-n, and model merging units 50-1 to 50-n.

The training DB 10-0 is a database that stores the training data of astandard common language.

The development DB 10-1 to 10-n are databases that store the developmentdata of dialectal-accented common languages of first to nth kinds,respectively.

The model generator 30-0 is used to generate a triphone standardcommon-language model based on the training data of the standard commonlanguage stored in the training DB 10-0.

The model generators 30-1 to 30-n are a sequence of blocks that generatefirst to nth monophone dialectal-accented common-language models basedon the training data of dialectal-accented standard common languages offirst to nth kinds stored in the development databases 10-1 to 10-n,respectively.

The confusion matrix generators 40-1 to 40-n are a sequence of blocksthat generate first to nth confusion matrices by recognizing thedevelopment data of the first to nth dialectal-accented common languagesof first to nth kinds stored in the first to nth development databases10-1 to 10-n using the models generated by the corresponding modelgenerators 30-0 to 30-(n-1).

The model merging unit 50-1 generates a first temporary merged model ina manner that the first dialectal-accented common-language modelgenerated by the model generator 30-1 is merged into a standardcommon-language model generated by the model generator 30-0 according tothe first confusion matrix generated by the confusion matrix generator40-1. The model merging units 50-2 to 50-(n-1) generate second to(n−1)th temporary merged models in a manner that the second to (n−1)thdialectal-accented common-language models generated by the modelgenerators 30-2 to 30-(n-1) are each merged into a temporary mergedmodel generated by a model merging unit placed immediately prior theretoaccording to the second to (n−1)th confusion matrices generated by thecorresponding confusion matrix generators 40-2 to 40-(n-1).

The model merging unit 50-n finally generates a recognition model in amanner that the nth dialectal-accented common-language model generatedby the model generator 30-n is merged into the (n−1)th temporary mergedmodel generated by the model merging unit 50-(n-1) placed immediatelyprior thereto according to the nth confusion matrix generated by theconfusion matrix generator 40-n.

The control unit 200 controls the model generation unit 100 in such amanner as to operate according to the aforementioned modeling method.

In FIG. 2, the training DB 10-0 and the development DBs 10-1 to 10-n aredepicted as separate blocks. However, they may be configured as a singledatabase or a plurality of databases that store training data of astandard common language, development data of dialectal-accented commonlanguages of first to nth kinds. Also, the model generators 30-0 to 30-nare depicted as separate blocks in FIG. 2 but they may be configured asa single entity or a plurality of model generators and the single orplurality of model generators may be used in a time sharing manner,based on a control performed by the control unit 200. Although theconfusion matrix generators 40-1 to 40-n are depicted as separate blocksin FIG. 2, they may be configured as a single entity or a plurality ofconfusion matrix generators and the single or plurality of confusionmatrix generators may be used in a time sharing manner, based on acontrol performed by the control unit 200. Although the model mergingunits 50-1 to 50-n are depicted as separate blocks in FIG. 2, they maybe configured as a single entity or a plurality of model merging unitsand the single or plurality of model merging units may be used in a timesharing manner, based on a control performed by the control unit 200.

A concrete description is hereinbelow given of a method for modeling arecognition model capable of being compatible with two different kindsof dialectal-accented common languages (n=2).

This modeling method includes the following steps of:

(1) generating a triphone standard common-language model based ontraining data of standard common language, generating a first monophonedialectal-accented common-language model based on development data ofdialectal-accented common language of first kind, and generating asecond monophone dialectal-accented common-language model based ondevelopment data of dialectal-accented common language of second kind;

(2) acquiring a first confusion matrix by recognizing the developmentdata of the dialectal-accented common language of first kind using thestandard common-language model, and obtaining a temporary merged modelin a manner that the first dialectal-accented common-language model ismerged into the standard common-language model according to the firstconfusion matrix; and

(3) acquiring a second confusion matrix by recognizing the developmentdata of the dialectal-accented common language of second kind using thetemporary merged model, and obtaining a recognition model in a mannerthat the second dialectal-accented common-language model is merged intothe temporary merged model according to the second confusion matrix.

The merging method as described in the above steps (2) and (3) is suchthat:

the probability density function of the temporary merged model isexpressed by

p′(x|s)=λ₁ p(x|s)+(1−λ₁)p(x|d ₁)p(d ₁ |s)

where x is an observation feature vector of speech to be recognized, sis a hidden Markov state in the standard common-language model, d₁ is ahidden Markov state in the first dialectal-accented common-languagemodel, and λ₁ is a linear interpolating coefficient such that 0<λ₁<1.

Also, the probability density function of the recognition model isexpressed by

${p^{\prime\prime}\left( x \middle| s \right)} = {{\sum\limits_{k = 1}^{K}{w_{k}^{{({sc})}^{\prime}}{N_{k}^{({sc})}( \cdot )}}} + {\sum\limits_{m = 1}^{M}{\sum\limits_{n = 1}^{N}{w_{mn}^{{({{dc}\; 1})}^{\prime}}{N_{mn}^{({{dc}\; 1})}( \cdot )}}}} + {\sum\limits_{p = 1}^{P}{\sum\limits_{q = 1}^{Q}{w_{pq}^{{({{dc}2})}^{\prime}}{N_{pq}^{({{dc}\; 2})}( \cdot )}}}}}$

where w_(k) ^((sc)′) is a mixture weight for the hidden Markov state ofthe standard common-language model, w_(mn) ^((dc1)′) is a mixture weightfor the hidden Markov state of the first dialectal-accentedcommon-language model, w_(pq) ^((dc2)′) is a mixture weight for thehidden Markov state of the second dialectal-accented common-languagemodel, K is the number of Gaussian mixtures for Hidden Markov Modelstate s in the standard common-language model, N_(k) ^((sc))(·) is anelement of Gaussian mixture for Hidden Markov Model state s, M is thenumber of d₁ that is considered as the pronunciation variants occurringbetween the first dialectal-accented common-language model for d₁ andthe standard common-language-model, N is the number of Gaussian mixturesfor Hidden Markov Model state d₁ in the first dialectal-accentedcommon-language model, N_(mn) ^((dc1))(·) is an element of Gaussianmixture for Hidden Markov Model state d₁, P is the number of d₂ that isconsidered as the pronunciation variants occurring between the seconddialectal-accented model for d₂ and the standard common-language model,Q is the number of Gaussian mixtures for Hidden Markov Model state d₂ inthe second dialectal-accented model, N_(pq) ^((dc2))(·) is an element ofGaussian mixture for Hidden Markov Model state d₂.

The method according to the present embodiment is characterized by thefeatures that models created based on various kinds ofdialectal-accented data are merged into the standard common-languagemodel in an iterative manner. The fundamental flow of this method isillustrated in FIG. 1. In the case of merging two differentdialectal-accented common models and standard common-language modelusing the flow in FIG. 1, the probability density function of atemporary merged model can be expressed by

p′(x|s)=λ₁ p(x|s)+(1−λ₁)p(x|d ₁)p(d ₁ |s).

In the above equation, X is an observation feature vector of speech tobe recognized, s is a hidden Markov state in the standardcommon-language model, d₁ is a hidden Markov state in the firstdialectal-accented common-language model. λ₁ is a linear interpolatingcoefficient such that 0<λ₁<1, and indicates a mixture weight in thetemporary merged model. In the actual setting, the optimum λ₁ isdetermined through experiments. p(d₁|s) is the output probability of thehidden Markov state in the first dialectal-accented common-languagemodel given the corresponding hidden Markov state in the standardcommon-language model and indicates a variation of pronunciations in thedialect of first kind relative to the standard common language. For thesame reasoning, the probability density function of the final mergedmodel may be expressed by

$\begin{matrix}{{p^{\prime\prime}\left( x \middle| s \right)} = {{\lambda_{2}{p^{\prime}\left( x \middle| s \right)}} + {\left( {1 - \lambda_{2}} \right){p\left( x \middle| d_{2} \right)}{p^{\prime}\left( d_{2} \middle| s \right)}}}} \\{= {{\lambda_{2}\lambda_{1}{p\left( x \middle| s \right)}} + {{\lambda_{2}\left( {1 - \lambda_{1}} \right)}{p\left( x \middle| d_{1} \right)}{p\left( d_{1} \middle| s \right)}} +}} \\{{\left( {1 - \lambda_{2}} \right){p\left( x \middle| d_{2} \right)}{p^{\prime}\left( d_{2} \middle| s \right)}}} \\{= {{\lambda_{2}\lambda_{1}{\sum\limits_{k = 1}^{K}{w_{k}^{({sc})}{N_{k}^{({sc})}( \cdot )}}}} + {{\lambda_{2}\left( {1 - \lambda_{1}} \right)}{\sum\limits_{m = 1}^{M}{{P\left( d_{1m} \middle| s \right)} \cdot}}}}} \\{{{\sum\limits_{n = 1}^{N}{w_{mn}^{({{dc}\; 1})}{N_{mn}^{({{dc}\; 1})}( \cdot )}}} + {\left( {1 - \lambda_{2}} \right){\sum\limits_{p = 1}^{P}{{P\left( d_{2p} \middle| s \right)} \cdot}}}}} \\{{\sum\limits_{q = 1}^{Q}{w_{pq}^{({{dc}\; 2})}{N_{pq}^{({{dc}\; 2})}( \cdot )}}}} \\{= {{\sum\limits_{k = 1}^{K}{\lambda_{2}\lambda_{1}w_{k}^{({sc})}{N_{k}^{({sc})}( \cdot )}}} + {\sum\limits_{m = 1}^{M}{\sum\limits_{n = 1}^{N}{{\lambda_{2}\left( {1 - \lambda_{1}} \right)} \cdot}}}}} \\{{{{{P\left( d_{1m} \middle| s \right)} \cdot w_{mn}^{({{dc}\; 1})}}{N_{mn}^{({{dc}\; 1})}( \cdot )}} + {\sum\limits_{p = 1}^{P}{\sum\limits_{q = 1}^{Q}{\left( {1 - \lambda_{2}} \right) \cdot}}}}} \\{{{{P\left( d_{2p} \middle| s \right)} \cdot w_{pq}^{({{dc}\; 2})}}{N_{pq}^{({{dc}\; 2})}( \cdot )}}} \\{= {{\sum\limits_{k = 1}^{K}{w_{k}^{{({sc})}^{\prime}}{N_{k}^{({sc})}( \cdot )}}} + {\sum\limits_{m = 1}^{M}{\sum\limits_{n = 1}^{N}{w_{mn}^{{({{dc}\; 1})}^{\prime}}{N_{mn}^{({{dc}\; 1})}( \cdot )}}}} +}} \\{{\sum\limits_{p = 1}^{P}{\sum\limits_{q = 1}^{Q}{w_{pq}^{{({{dc}\; 2})}^{\prime}}{N_{pq}^{({{dc}\; 2})}( \cdot )}}}}}\end{matrix}$

where d₂ is a hidden Markov state in the second dialectal-accentedcommon-language model, λ₂ is a linear interpolating coefficient suchthat 0<λ₂<1, and indicates a mixture weight in the final merged model.In the actual setting, the optimum λ₂ is determined through experiments.K is the number of Gaussian mixtures for Hidden Markov Model state s inthe standard common-language model. N_(k) ^((sc))(·) is an element ofGaussian mixture for Hidden Markov Model state s. M is the number of d₁that is considered as the pronunciation variants occurring between thefirst dialectal-accented common-language model for d₁ and the standardcommon-language-model; N is the number of Gaussian mixtures for HiddenMarkov Model state d₁ in the first dialectal-accented common-languagemodel. N_(mn) ^((dc1))(·) is an element of Gaussian mixture for HiddenMarkov Model state d₁ . P(d_(1m)|s) is the corresponding probability ofpronunciation modeling. P is the number of d₂ that is considered as thepronunciation variants occurring between the second dialectal-accentedmodel for d₂ and the standard common-language model; Q is the number ofGaussian mixtures for Hidden Markov Model state d₂ in the seconddialectal-accented model. N_(pq) ^((dc2))(·) is an element of Gaussianmixture for Hidden Markov Model state d₂. P(d_(2p)|s) is thecorresponding probability of pronunciation model.

It is easy to see from the last line of the above equation that thefinal merged model is actually constructed by taking the weighted sum ofthe standard common model, the first dialectal-accented model and thesecond dialectal-accented model. w_(k) ^((sc)′), w_(mn) ^((dc1)′) andw_(pq) ^((dc2)′) indicate the mixture weights of three modelsrepresented by the above equation. Since the confusion matricesP(d_(1m)|s) and P(d_(2p)|s) and the interpolating coefficients λ₁ and λ₂are already known, the weights for the mixture of normal distributionsof three models can be easily determined.

A description is now given of exemplary embodiments:

TABLE 1 (Description of experimental data) Data set Database DetailsTraining set of Training data of 120 speakers, 200 standard commonstandard common long sentences per language language speaker Test set ofTest data of 12 speakers, 100 standard common standard common commandsper speaker language language Development set of Development data of 20speakers, 50 long Chuan common Chuan dialectal sentences per languagecommon language speaker Test set of Chuan Test data of Chuan 15speakers, 75 common language dialectal common commands per speakerlanguage Development set of Development data of 20 speakers, 50 longMinnan common Minnan dialectal sentences per language common languagespeaker Test set of Minnan Test data of Minnan 15 speakers, 75 commonlanguage dialectal common commands per speaker language

As evident from Table 1, data are divided into the standard commonlanguage, the Chuan (an abbreviation of Sichuan Dialect) dialectalcommon language, and the Minnan dialectal common language, and the dataare also divided into two parts, namely data for training/developmentand data for testing.

Baseline:

TABLE 2 (Description of a test baseline system) Word Error Rate (WER)Test set Test set of Test set of Test set of Chuan standard Minnandialectal Recognition common dialectal common model language languagelanguage Mixed training 8.5% 21.7% 21.1% recognition model

A mixed training recognition model is used in the baseline. This mixedtraining recognition model is trained based on the total of three kindsof data (standard and 2 dialectal).

Results of experiments:

TABLE 3 Results of experiments Word Error Rate (WER) Test set Test setof Test set of Test set of Chuan standard Minnan dialectal Recognitioncommon dialectal common model language common language languageRecognition 6.3% 11.2% 15.0% model according to the present embodiment

As evident from Table 3, the use of a model trained by employing themethod of calculation according to the present embodiment obviouslyimproves the recognition rate for two dialects as well. At the sametime, the recognition rate for the standard common language issignificantly improved. Thus the methods according to theabove-described embodiment prove viable and effective.

Further, according to the above-described methods, the final recognitionmodel can be obtained by iteratively merging each dialectal-accentedcommon-language model into the standard common-language model.

1. A method for modeling a speech recognition of common language underthe influence of a plurality of dialects, the method including: (1)generating a triphone standard common-language model based on trainingdata of standard common language, generating a first monophonedialectal-accented common-language model based on development data ofdialectal-accented common language of first kind, and generating asecond monophone dialectal-accented common-language model based ondevelopment data of dialectal-accented common language of second kind;(2) generating a first confusion matrix by recognizing the developmentdata of the dialectal-accented common language of first kind using thestandard common-language model, and obtaining a temporary merged modelin a manner that the first dialectal-accented common-language model ismerged into the standard common-language model according to the firstconfusion matrix; and (3) generating a second confusion matrix byrecognizing the development data of the dialectal-accented commonlanguage of second kind using the temporary merged model, and obtaininga recognition model in a manner that the second dialectal-accentedcommon-language model is merged into the temporary merged modelaccording to the second confusion matrix.
 2. A modeling method accordingto claim 1, wherein a probability density function of the temporarymerged model is expressed byp′(x|s)=λ₁ p(x|s)+(1−λ₁)p(x|d ₁)p(d ₁ |s) where x is an observationfeature vector of voice to be recognized, s is a hidden Markov state inthe standard common-language model, d₁ is a hidden Markov state in thefirst dialectal-accented common-language model, and λ₁ is a linearinterpolating coefficient such that 0<λ₁<1, and wherein a probabilitydensity function of the recognition model is expressed by${p^{\prime\prime}\left( x \middle| s \right)} = {{\sum\limits_{k = 1}^{K}{w_{k}^{{({sc})}^{\prime}}{N_{k}^{({sc})}( \cdot )}}} + {\sum\limits_{m = 1}^{M}{\sum\limits_{n = 1}^{N}{w_{mn}^{{({{dc}\; 1})}^{\prime}}{N_{mn}^{({{dc}\; 1})}( \cdot )}}}} + {\sum\limits_{p = 1}^{P}{\sum\limits_{q = 1}^{Q}{w_{pq}^{{({{dc}2})}^{\prime}}{N_{pq}^{({{dc}\; 2})}( \cdot )}}}}}$where w_(k) ^((sc)′) is a mixture weight for the hidden Markov state ofthe standard common-language model, w_(mn) ^((dc1)′) is a mixture weightfor the hidden Markov state of the first dialectal-accentedcommon-language model, w_(pq) ^((dc2)′) is a mixture weight for thehidden Markov state of the second dialectal-accented common-languagemodel, K is the number of Gaussian mixtures for Hidden Markov Modelstate s in the standard common-language model, N_(k) ^((sc))(·) is anelement of Gaussian mixture for Hidden Markov Model state s, M is thenumber of d₁ that is considered as the pronunciation variants occurringbetween the first dialectal-accented common-language model for d₁ andthe standard common-language-model, N is the number of Gaussian mixturesfor Hidden Markov Model state d₁ in the first dialectal-accentedcommon-language model, N_(mn) ^((dc1))(·) is an element of Gaussianmixture for Hidden Markov Model state d₁, P is the number of d₂ that isconsidered as the pronunciation variants occurring between the seconddialectal-accented model for d₂ and the standard common-language model,Q is the number of Gaussian mixtures for Hidden Markov Model state d₂ inthe second dialectal-accented model, N_(pq) ^((dc2))(·) is an element ofGaussian mixture for Hidden Markov Model state d₂.
 3. A computerreadable medium encoded with a program, executable by a computer, formodeling a common-language speech recognition under the influence of aplurality of dialects, the program including the functions of: (1)generating a triphone standard common-language model based on trainingdata of standard common language, generating a first monophonedialectal-accented common-language model based on development data ofdialectal-accented common language of first kind, and generating asecond monophone dialectal-accented common-language model based ondevelopment data of dialectal-accented common language of second kind;(2) generating a first confusion matrix by recognizing the developmentdata of the dialectal-accented common language of first kind using thestandard common-language model, and obtaining a temporary merged modelin a manner that the first dialectal-accented common-language model ismerged into the standard common-language model according to the firstconfusion matrix; and (3) generating a second confusion matrix byrecognizing the development data of the dialectal-accented commonlanguage of second kind using the temporary merged model, and obtaininga recognition model in a manner that the second dialectal-accentedcommon-language model is merged into the temporary merged modelaccording to the second confusion matrix.
 4. A computer readable mediumaccording to claim 3, wherein a probability density function of thetemporary merged model is expressed byp′(x|s)=λ₁ p(x|s)+(1−λ₁)p(x|d ₁)p(d ₁ |s) where x is an observationfeature vector of voice to be recognized, s is a hidden Markov state inthe standard common-language model, d₁ is a hidden Markov state in thefirst dialectal-accented common-language model, and λ₁ is a linearinterpolating coefficient such that 0<λ₁<1, and wherein a probabilitydensity function of the recognition model is expressed by${p^{\prime\prime}\left( x \middle| s \right)} = {{\sum\limits_{k = 1}^{K}{w_{k}^{{({sc})}^{\prime}}{N_{k}^{({sc})}( \cdot )}}} + {\sum\limits_{m = 1}^{M}{\sum\limits_{n = 1}^{N}{w_{mn}^{{({{dc}\; 1})}^{\prime}}{N_{mn}^{({{dc}\; 1})}( \cdot )}}}} + {\sum\limits_{p = 1}^{P}{\sum\limits_{q = 1}^{Q}{w_{pq}^{{({{dc}2})}^{\prime}}{N_{pq}^{({{dc}\; 2})}( \cdot )}}}}}$where w_(k) ^((sc)′) is a mixture weight for the hidden Markov state ofthe standard common-language model, w_(mn) ^((dc1)′) is a mixture weightfor the hidden Markov state of the first dialectal-accentedcommon-language model, w_(pq) ^((dc2)′) is a mixture weight for thehidden Markov state of the second dialectal-accented common-languagemodel, K is the number of Gaussian mixtures for Hidden Markov Modelstate s in the standard common-language model, N_(k) ^((sc))(·) is anelement of Gaussian mixture for Hidden Markov Model state s, M is thenumber of d₁ that is considered as the pronunciation variants occurringbetween the first dialectal-accented common-language model for d₁ andthe standard common-language-model, N is the number of Gaussian mixturesfor Hidden Markov Model state d₁ in the first dialectal-accentedcommon-language model, N_(mn) ^((dc1))(·) is an element of Gaussianmixture for Hidden Markov Model state d₁, P is the number of d₂ that isconsidered as the pronunciation variants occurring between the seconddialectal-accented model for d₂ and the standard common-language model,Q is the number of Gaussian mixtures for Hidden Markov Model state d₂ inthe second dialectal-accented model, N_(pq) ^((dc2))(·) is an element ofGaussian mixture for Hidden Markov Model state d₂.
 5. A method formodeling a common-language speech recognition under the influence of ann kinds of dialects (n being an integer greater than or equal to 2), themethod including: (1) generating a triphone standard common-languagemodel based on standard common-language training data, and generatingfirst to n-th monophone dialectal-accented common-language models basedon development data of dialectal-accented common languages of first ton-th kinds, respectively; (2) generating a first confusion matrix byrecognizing the development data of the dialectal-accented commonlanguage of first kind using the standard common-language model, andobtaining a first temporary merged model in a manner that the firstdialectal-accented common-language model is merged into the standardcommon-language model according to the first confusion matrix; and (3)generating an i-th confusion matrix by recognizing the development dataof the dialectal-accented common language of i-th kind using an (i−1)thtemporary merged model (i being an integer such that 2≦i≦n), andobtaining a recognition model by repeating, from i=2 to i=n, anoperation of merging the i-th dialectal-accented common-language modelinto the (i−1)th temporary merged model according to the i-th confusionmatrix.
 6. A computer readable medium encoded with a program, executableby a computer, for modeling a common-language speech recognition underthe influence of an n kinds of dialects (n being an integer greater thanor equal to 2), the program including the functions of: (1) generating atriphone standard common-language model based on standardcommon-language training data, and generating first to n-th monophonedialectal-accented common-language models based on development data ofdialectal-accented common languages of first to n-th kinds,respectively; (2) generating a first confusion matrix by recognizing thedevelopment data of the dialectal-accented common language of first kindusing the standard common-language model, and obtaining a firsttemporary merged model in a manner that the first dialectal-accentedcommon-language model is merged into the standard common-language modelaccording to the first confusion matrix; and (3) generating an i-thconfusion matrix by recognizing the development data of thedialectal-accented common language of i-th kind using an (i−1)thtemporary merged model (i being an integer such that 2≦i≦n), andobtaining a recognition model by repeating, from i=2 to i=n, anoperation of merging the i-th dialectal common-language model into the(i−1)th temporary merged model according to the i-th confusion matrix.7. A system for modeling a common-language speech recognition under theinfluence of a plurality of dialects, the system including a modelgenerating unit and a control unit for controlling the entire operationof the model generating unit, the model generating unit including: astandard common-language training database which stores training data ofstandard common language; a first development database which storesfirst development data of dialectal-accented common language of firstkind and a second development database which stores development data ofdialectal-accented common language of second kind; a standardcommon-language model generator which generates a triphone standardcommon-language model based on the training data of standard commonlanguage stored in the standard common-language training database; afirst model generator which generates a first monophonedialectal-accented common-language model based on the development dataof dialectal-accented common language of first kind stored in the firstdevelopment database, and a second model generator which generates asecond monophone dialectal-accented common-language model based on thedevelopment data of dialectal-accented common language of second kindstored in the second development database; a first confusion matrixgenerator which generates a first confusion matrix in a manner that thedevelopment data of the dialectal-accented common language of first kindstored in the first development database are recognized using thestandard common-language model generated by the standard common-languagemodel generator; a first model merging unit which generates a temporarymerged model in a manner that the first dialectal-accentedcommon-language model generated by the first model generator is mergedinto the standard common-language model generated by the standardcommon-language model generator according to the first confusion matrixgenerated by the first confusion matrix generator; a second confusionmatrix generator which generates a second confusion matrix in a mannerthat the development data of the dialectal-accented common language ofsecond kind stored in the second development database are recognizedusing the temporary merged model generated by the first model mergingunit; and a second model merging unit which generates a recognitionmodel in a manner that the second dialectal-accented common-languagemodel generated by the second model generator is merged into thetemporary merged model generated by the first model merging unitaccording to the second confusion matrix generated by the secondconfusion matrix generator.
 8. A modeling system according to claim 7,wherein a probability density function of the temporary merged model isexpressed byp′(x|s)=λ₁ p(x|s)+(1−λ₁)p(x|d ₁)p(d ₁ |s) where x is an observationfeature vector of voice to be recognized, s is a hidden Markov state inthe standard common-language model, d₁ is a hidden Markov state in thefirst dialectal-accented common-language model, and λ₁ is a linearinterpolating coefficient such that 0<λ₁<1, and wherein a probabilitydensity function of the recognition model is expressed by${p^{\prime\prime}\left( x \middle| s \right)} = {{\sum\limits_{k = 1}^{K}{w_{k}^{{({sc})}^{\prime}}{N_{k}^{({sc})}( \cdot )}}} + {\sum\limits_{m = 1}^{M}{\sum\limits_{n = 1}^{N}{w_{mn}^{{({{dc}\; 1})}^{\prime}}{N_{mn}^{({{dc}\; 1})}( \cdot )}}}} + {\sum\limits_{p = 1}^{P}{\sum\limits_{q = 1}^{Q}{w_{pq}^{{({{dc}2})}^{\prime}}{N_{pq}^{({{dc}\; 2})}( \cdot )}}}}}$where w_(k) ^((sc)′) is a mixture weight for the hidden Markov state ofthe standard common-language model, w_(mn) ^((dc1)′) is a mixture weightfor the hidden Markov state of the first dialectal-accentedcommon-language model, w_(pq) ^((dc2)′) is a mixture weight for thehidden Markov state of the second dialectal-accented common-languagemodel, K is the number of Gaussian mixtures for Hidden Markov Modelstate s in the standard common-language model, N_(k) ^((sc))(·) is anelement of Gaussian mixture for Hidden Markov Model state s, M is thenumber of d₁ that is considered as the pronunciation variants occurringbetween the first dialectal-accented common-language model for d₁ andthe standard common-language-model, N is the number of Gaussian mixturesfor Hidden Markov Model state d₁ in the first dialectal-accentedcommon-language model, N_(mn) ^((dc1))(·) is an element of Gaussianmixture for Hidden Markov Model state d₁, P is the number of d₂ that isconsidered as the pronunciation variants occurring between the seconddialectal-accented model for d₂ and the standard common-language model,Q is the number of Gaussian mixtures for Hidden Markov Model state d₂ inthe second dialectal-accented model, N_(pq) ^((dc2))(·) is an element ofGaussian mixture for Hidden Markov Model state d₂.
 9. A system formodeling a common-language speech recognition under the influence of aplurality of dialects, wherein at least one set of units among the firstand second model generators, the first and second confusion matrixgenerators, and the first and second model merging units according toclaim 7 are of a single structure and used in a time sharing manner.